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Abstract 

The nonsynonymous/synonymous rate ratio (co - d N /d s ) is an important measure of the mode and strength of natural 
selection acting on nonsynonymous mutations in protein-coding genes. The simplest such analysis is the estimation of the 
d N /d s ratio using two sequences. Both heuristic counting methods and the maximum-likelihood (ML) method based on a 
codon substitution model are widely used for such analysis. However, these methods do not have nice statistical prop- 
erties, as the estimates can be zero or infinity in some data sets, so that their means and variances are infinite. In large 
genome-scale comparisons, such extreme estimates (either 0 or oo) of co and sequence distance (t) are common. Here, we 
implement a Bayesian method to estimate co and t in pairwise sequence comparisons. Using a combination of computer 
simulation and real data analysis, we show that the Bayesian estimates have better statistical properties than the ML 
estimates, because the prior on co and t shrinks the posterior of those parameters away from extreme values. We also 
calculate the posterior probability for co > 1 as a Bayesian alternative to the likelihood ratio test. The new method is 
computationally efficient and may be useful for genome-scale comparisons of protein-coding gene sequences. 

Key words: nonsynonymous/synonymous rate ratio, evolutionary distance, Bayesian estimation, pairwise comparisons, 
protein-coding sequences. 



Introduction 

The nonsynonymous/synonymous rate ratio (co - d N /d s ) is an 
important measure of the mode and strength of natural se- 
lection acting on protein-coding genes (Kimura 1977). A 
number of methods have been developed to estimate co 
from pairwise sequence alignments, ranging from heuristic 
counting methods (Li et al. 1985; Nei and Gojobori 1986; 
Yang and Nielsen 2000) to maximum-likelihood (ML) meth- 
ods based on an explicit Markov model of codon evolution 
(Goldman and Yang 1994). ML estimates (MLEs) of co for 
thousands of genes are routinely calculated as descriptive 
statistics in genomic comparisons (Nielsen et al. 2005; Ge 
et al. 2008; Walters and Harrison 2010; Buschiazzo et al. 
2012; Gladieux et al. 2013; Wang and Chen 2013). Although 
the ML method for pairwise comparisons produces reason- 
able estimates of co and sequence distance (t) for most data 
sets, it suffers from a few problems when the data sets are 
extreme. For example, the MLE of co (co) is 0 when the two 
compared sequences have only synonymous differences and 
oo when they have only nonsynonymous differences. 
Similarly, when the sequences are identical, the MLE i is 0 
and co is not unique. When the sequences are very divergent t 
may be oo. 

Because of these infinite or undefined estimates, neither co 
nor i have finite means or variances. Extreme values of co and t 
are commonly encountered in genome-level comparisons of 
thousands of genes, and those extreme estimates cause diffi- 
culties with the calculation of summary statistics (such as 
mean co and t across all genes in the genome). An estimation 



method that always produces finite and reasonable estimates 
for co and t is thus desirable. Here, we develop a Bayesian 
method to calculate the posterior means of co and t between 
two sequences, denoted co and t. Using computer simulation, 
we show that the posterior means of co and t are well behaved 
and have better Frequentist properties than the MLEs. We 
then use ML and the new Bayesian method to estimate co and 
t from pairwise gene alignments for the genomes of four 
mammals (human, chimpanzee, mouse, and rat) and three 
bacterial strains (Escherichia coli 0157:H7, £ coli K-12, and 
Salmonella typhimurium LT2). We show that extreme MLEs 
of co and t are common in these data sets, and that the 
Bayesian method produces finite, well-behaved estimates. 
The new Bayesian method is computationally efficient and 
is implemented in the CODEML program of the PAML pack- 
age (Yang 2007). 

New Bayesian Approach to Estimate co and t 

Here, we summarize the main features of the new Bayesian 
approach. The joint posterior distribution of t and co given the 
data x (the pairwise sequence alignment) is 

f(t,co\x) = ^f(x\t,co)f(t,co), (1) 

where f(x\t,co) is the likelihood or the probability of 
observing the data x given t and co, f(t, co) is the prior and 
C = fff(x\t, co)f(t, co)dtdco is the normalizing constant. The 
posterior is proportional to the product of the likelihood and 
the prior. If the model involves the transition/transversion rate 
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ratio (k), its AALE (/?) is used. If the model involves nucleotide 
or codon frequency parameters, they are estimated using the 
observed frequencies. When the data are informative, the 
likelihood dominates the posterior. When the data are unin- 
formative, the prior may have a strong influence on the pos- 
terior. Here, we use two independent gamma distributions to 
construct the joint prior of t and co: 

f(t,co) = C(t | 1.1,1.1) x GO | 1.1,2.2), (2) 

where the gamma density C(x | a, jS) has mean a/f3 and var- 
iance a/0 2 . Here, the prior means of t and co are 1 and 0.5, 
respectively, and the shape parameter a - 1.1 indicates that 
the priors are quite diffuse. This joint prior has a mode away 
from (0,0) and the prior density decays to 0 as either t or co 
approaches oo, thus penalizing extreme values. The likelihood 
is calculated from a pairwise sequence alignment using a 
codon substitution model (Yang and Nielsen 1998). As 
point estimates of co and t we use their posterior means 

oo oo 

co = E(co | x) = ^ j j cof(x 1 1, co, K)f(t, co)dtdco, (3) 

0 0 

oo oo 

t=E(t\x) = y j tf(x 1 1, co, £)/(t, co)dtdco. (4) 

0 0 

The posterior variances and covariance of co and t can be 
similarly defined and can be calculated using standard nu- 
merical techniques. We use Gaussian quadrature to calculate 
all integrals numerically. We use similar techniques to calcu- 
late P(co > 1 | x), the posterior probability that co > 1, which 
may be compared with the likelihood ratio test (LRT) of the 
null hypothesis H 0 : co - 1 (see Methods and Materials). 

We consider five different scenarios in which the numerical 
calculations of the integrals may differ. We simulated five data 
sets to represent those five scenarios, each consisting of 2 
sequences of 100 codons, with different numbers of synony- 
mous (S d ) and nonsynonymous (N d ) differences. The poste- 
rior and likelihood surfaces for the five cases are shown in 
figure 1. 

Case I: (S d > 0, N d > 0). This is the most common case, 
with both synonymous and nonsynonymous differences ob- 
served. The data are quite informative about co and t and the 
posterior distribution resembles the likelihood (fig. 1A' and 
A). In our example data set, we have S = 73.7, N = 226.3, 
S d = 18.5, N d = 6.5, where S and N are the numbers of synon- 
ymous and nonsynonymous sites. The MLEs are t = 0.30 and 
£ = 0.11 whereas the posterior means are t = 0.31 and 
£ = 0.13. 

Case II: (S d = N d = 0). In this case, the two sequences are 
identical. The likelihood is maximized when t = 0 and when 
t = 0, co has no effect on the likelihood, so the MLE of co is 
not unique (fig. 1B). In our example, S = 73.3, N = 226.7, 
S d = N d = 0. The posterior has a single mode and the posterior 
means are t = 0.011 and £ = 0.496 (fig. ^B f ). Note that the 
posterior mean of co is almost equal to the prior mean, 
since the data are uninformative about co. Also, the posterior 



mean is markedly different from the posterior mode, because 
the posterior distribution is highly skewed. 

Case III: (S d > 0, N d = 0). Only synonymous differences are 
observed. In our example, S = 74.4, N = 225.6, S d = 24 and 
N d = 0. Then, we have i = 0.306 and co = 0 (fig. 1C). The pos- 
terior for co has a mode away from 0 and t = 0.316 and 
£ = 0.014 (fig. 1C). 

Case IV: (S d » 0, N d » 0). Only nonsynonymous differ- 
ences are observed. In our example, S = 73.2, N = 226.8, S d = 0, 
N d = 40. The MLEs are i = 0.48 and £ = oo (fig. 1D). The pos- 
terior has a well-defined mode and thus t = 0.47 and £ = 3.1 
(fig. 1D0. 

Case V: (S d » 0, N d » 0). The two sequences are so di- 
vergent that they look like random sequences (S = 75.9, 
N = 224.1, S d = 75, N d =175). Here, the likelihood increases 
with the increase of both t and co, with the MLEs at t = oo 
and £ = oo (fig. 1E). In the Bayesian analysis, the prior penal- 
izes large values and thus the posterior means are t = 10.31 
and £ = 0.72 (fig. 1E'). Note that the posterior mean of co is 
close to the prior mean, since the data of two nearly random 
sequences are uninformative about co. 

These five cases illustrate how the prior influences the 
posterior depending on whether the data are informative 
or not. The posterior means of t and co are finite for all five 
cases, whereas the MLEs are not. We note that because the 
MLEs of t and co may be infinite, their mean square errors 
(MSEs) are oo as well. The MSEs of the posterior means are in 
contrast always well defined. In this sense, the posterior mean 
estimates have better Frequentist properties than the MLEs. 
In the next section, we study the statistical properties of the 
Bayesian estimates of t and co using simulated and real data, in 
comparison with the MLEs. We calculate the MSEs of the 
MLEs by excluding the infinite estimates. 

Results 

Analysis of Simulated Data 

To examine the statistical properties of the posterior esti- 
mates of t and co, we conducted a computer simulation. 
The program EVOLVER from the PAML package (Yang 
2007) was used to generate pairwise sequence alignments 
of length L c = 500 codons. We used t = 0.1, 0.5, 1, and 5 and 
a; = 0.01, 0.1, 0.5, and 2 (16 combinations) with transition/ 
transversion rate ratio k = 2 and equal codon frequencies 
(1/61) to generate the data sets. The number of replicates 
was 10,000. The simulated data sets were analyzed using both 
ML and the new Bayesian method using the CODEML pro- 
gram (Yang 2007). The same prior (eq. 2) was used for all data 
sets. Equal codon frequencies are assumed in the model 
(Fequal model). 

Figures 2 and 3 show the histograms (smoothed densities) 
of posterior mean estimates and MLEs of t and co. As we see in 
figure 2, ML and Bayesian results are nearly identical for all 
combinations of co = 0.1 and 0.5 and t = 0.5 and 1. However, 
for co = 0.01, Bayesian estimates of co are shifted to the right 
(too large) for all t values, as the prior for co has a mean of 0.5 
and affects the posterior estimates. For co = 2, posterior esti- 
mates of co are shifted to the left (too small) due to the prior. 
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Fig. 1. Contour plots of log-likelihood (A-E) and log-posterior (A'-E') densities for co and t for five synthetic pairwise sequence alignments of 100 
codons. The dashed lines indicate the MLE. Five cases are analyzed: I. normal sequences (A and A), II. identical sequences (B and B'), III. sequences with 
only synonymous changes (C and C), IV. with only nonsynonymous changes (D and D'), V. random sequences (E and E'). 



Generally, both methods behave best (estimates are more 
concentrated around the true value) for intermediate dis- 
tances (t = 0.5 and 1), because sequences of moderate diver- 
gences are the most informative. The estimates of t show 
similar patterns (fig. 3). Although for t = 0.5 and 1 the 



Bayesian estimates are almost identical to the MLEs, for 
t = 0.1 Bayesian results are slightly shifted to the right (too 
large) and for t = 5 they are shifted to the left (too small). 

The means of the Bayesian and ML estimates, the square 
root of the MSE (VMSE), and the 2.5% and 97.5% percentiles 
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Fig. 2. Kernel density (smoothed histogram) of MLEs (dashed red) and Bayesian posterior means (solid green) for co in simulated data sets. The true 
values of co and t are shown on the top and left of the plots, respectively. The sequence length is 500 codons. The number of replicates is 10,000. The 
vertical dashed lines correspond to the true values of co. Independent gamma priors are used &>~G(1.1, 2.2), t~ C(1.1, 1.1) (eq. 2). 



of estimates from the 10,000 simulations are presented in 
tables 1 and 2. Those for the ML method are calculated 
after the infinite estimates are removed. We see that for 
highly similar (t = 0.1) and highly divergent (t = 5) sequences, 
the prior has a noticeable impact. For example, when t = 0.1 
the mean of Bayesian estimates of co is 0.02 when the true 
co = 0.01 and is 1.591 when the true co = 2.0. The mean MLEs 
are in comparison closer to the true values than the means of 
Bayesian estimates. However, the means for the MLEs are 
calculated after data sets in which co = oo are excluded, 
whereas those same data sets are included in the calculation 
of the Bayesian estimates. Similar patterns are observed con- 
cerning estimates oft. Moreover, for small and intermediate co 
and t, ML and Bayesian methods have similar MSE, but for 
large co and t, the Bayesian has smaller MSE indicating that in 
those cases Bayesian estimates are preferable to the MLEs. 

We also considered a test of positive selection, indicated by 
co > 1. For ML, a LRT is used to compare H 0 : co = 1 against 
co > 1, at the 5% significance level. In the Bayesian framework, 
we require the posterior probability to exceed the threshold 
P(co > 1 | x) > 0.95. For the true co = 0.01, 0.1, 0.5, no data sets 
showed significant positive selection by either method. When 
the true co - 2 and t = 0.5, 1, 5, both methods correctly detect 
positive selection in almost 100% of the replicate data sets, so 
that the power of detecting positive selection is high in both 
methods but with the LRT having more power (table 1). 



When co = 2 and t = 0.1, positive selection is detected in 
35% and 61% of data sets by the Bayesian and ML methods, 
respectively. In this case, given the short sequence distance, 
the prior has quite some impact on the ability of the Bayesian 
method to detect selection. In particular, the prior mean 
(co - 0.5) is smaller than the true value (co - 2), so that co is 
shrunk away from 1. 

Analysis of Real Data 

We applied both ML and Bayesian methods to estimate co 
and t for pairwise alignments of protein-coding genes from 
four mammalian genomes (human, chimpanzee, mouse, and 
rat) and from three bacterial genomes (E. coli 0157:H7, £ coli 
K-12, and S. typhimurium LT2). In all analyses, the codon 
frequencies were estimated by using the observed codon fre- 
quencies in the genes (the F61 model). 

Analysis of the Mammalian Data Set 
We conducted three sets of pairwise comparisons: human 
versus chimpanzee, human versus mouse, and mouse versus 
rat. Figure 4 shows the distributions (smoothed histograms) 
of posterior means and the MLEs of t and co in those com- 
parisons. In the human-chimpanzee comparison, the 
Bayesian co estimates are slightly shifted to the right com- 
pared with the MLEs for low co values and shifted to the left 
for high co values. The mean, median, and 25% and 75% 
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Fig. 3. Kernel density (smoothed histogram) of MLEs (dashed red) and Bayesian posterior means (solid green) for t in simulated data sets. Details as in 
figure 2. 



Table 1. Summary Statistics of Bayesian (top, underlined) and ML (bottom) Estimates of co from 10,000 Simulated Data Sets. 



co = 0.01 



co — 0.1 



co — 0.5 



oo = 2 



Mean 


Vmse 


2.5% 


97.5% 


N 0 


Mean 


Vmse 


2.5% 


97.5% 


Mean 


Vmse 


2.5% 


97.5% 


Noo 


Mean 


Vmse 


2.5% 


97.5% 


Noo 


P + 


0.020 


0.014 


0.007 


0.044 


0 


0.118 


0.045 


0.052 


0.214 


0.543 


0.160 


0.301 


0.904 


0 


1.591 


0.546 


0.966 


2.359 


0 


35.1 


0.011 


0.009 


0 


0.033 


2861 


0.103 


0.039 


0.041 


0.194 


0.528 


0.172 


0.278 


0.936 


0 


2.365 


1.484 


1.015 


5.626 


3 


60.7 


0.012 


0.005 


0.005 


0.021 


0 


0.104 


0.018 


0.072 


0.141 


0.511 


0.076 


0.379 


0.677 


0 


1.878 


0.329 


1.360 


2.543 


0 


98.3 


0.010 


0.004 


0.003 


0.019 


15 


0.101 


0.018 


0.069 


0.138 


0.506 


0.076 


0.374 


0.674 


0 


2.064 


0.424 


1.409 


3.031 


0 


98.9 


0.011 


0.003 


0.006 


0.018 


0 


0.102 


0.014 


0.076 


0.132 


0.506 


0.062 


0.397 


0.637 


0 


1.922 


0.278 


1.466 


2.497 


0 


99.9 


0.010 


0.003 


0.005 


0.017 


0 


0.100 


0.014 


0.075 


0.130 


0.503 


0.062 


0.393 


0.635 


0 


2.038 


0.326 


1.508 


2.764 


0 


100 


0.014 


0.005 


0.009 


0.022 


0 


0.129 


0.038 


0.089 


0.183 


0.526 


0.109 


0.348 


0.755 


0 


1.876 


0.374 


1.331 


2.642 


0 


97.4 


0.010 


0.005 


0 


0.019 


370 


0.101 


0.034 


0.037 


0.171 


0.515 


0.981 


0.226 


0.762 


44 


2.120 


1.398 


1.400 


3.228 


0 


98.6 



t = 0.1 



t = 0.5 



t=1 



t=S 



Note. — The Fequal model is used for codon frequencies. Results for ML have been calculated after removing infinite estimates. For &> = 0.1, there were no data sets with 0 or 
infinite estimates. N 0 is the number of replicates with co-0, whereas is the number of replicates with co = co. P+ is the proportion of replicates with significant evidence for 
positive selection indicated by P{co > 1 | x) > 0.95 in the Bayesian method or by a significant LRT at the 5% level (one-sided with critical value 2.71) in the likelihood method. 



percentiles of the Bayesian estimates are 0.369, 0.320, and 
(0.180, 0.500) whereas those of the MLEs are 0.307, 0.193, 
and (0.062, 0.411) (table 3). The human and chimpanzee 
genes are very similar and the patterns are similar to those 
observed in computer simulation for low t values. Moreover, 
there are 377 and 2,507 gene alignments in which t = 0 and 
co = 0, respectively, as well as 2 and 423 alignments where 
t = oo and co = oo, respectively. The Bayesian method does 



not produce any such extreme estimates. The number of 
genes in which the co estimate is >1 is 1,121 for ML and 
299 for the Bayesian method (table 4). The discrepancy is 
the result of two effects, a short evolutionary distance and 
a short sequence length, both indicating a lack of information 
and greater influence from the prior. Genes with co > 1 tend 
to be small (median sequence length 313 codons, compared 
with 454 codons for all genes). For example, one gene among 
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Table 2. Summary Statistics of Bayesian (top, underlined) and ML (bottom) Estimates of t from 10,000 Simulated Data Sets. 









t = 


0.1 






t = 


0.5 






t = 


1 








t= 5 










Mean 


Vmse 


2.5% 


97.5% 


Mean 


Vmse 


2.5% 


97.5% 


Mean 


Vmse 


2.5% 


97.5% 


Mean 


Vmse 


2.5% 


97.5% 


Noo 


(') 


= 0.01 


0.102 
0.100 


0.015 
0.015 


0.074 
0.072 


0.134 
0.132 


0.504 
0.503 


0.045 
0.045 


0.421 
0.419 


0.596 
0.595 


1.013 
1.011 


0.100 
0.100 


0.837 
0.836 


1.223 
1.222 


3.910 
7.572 


1.322 
8.922 


2.600 
2.676 


5.506 
43.744 


0 

244 


CO 


= 0.1 


0.102 


0.015 


0.075 


0.133 


0.503 


0.041 


0.427 


0.587 


1.007 


0.077 


0.865 


1.171 


4.406 


0.869 


3.317 


5.795 


0 


0.100 


0.015 


0.073 


0.131 


0.502 


0.041 


0.425 


0.585 


1.006 


0.077 


0.864 


1.170 


5.629 


2.700 


3.373 


11.506 


24 


0) 


= 0.5 


0.102 


0.015 


0.075 


0.132 


0.503 


0.036 


0.436 


0.574 


1.004 


0.057 


0.895 


1.118 


5.158 


1.469 


4.249 


6.368 


0 


0.100 


0.015 


0.073 


0.130 


0.501 


0.036 


0.434 


0.572 


1.002 


0.057 


0.894 


1.116 


5.440 


2.601 


4.228 


7.979 


43 






0.102 


0.015 


0.075 


0.131 


0.501 


0.035 


0.434 


0.571 


1.001 


0.056 


0.895 


1.112 


4.988 


0.737 


4.274 


6.035 


0 


CO 


= 2 


0.100 


0.014 


0.073 


0.129 


0.500 


0.035 


0.433 


0.571 


1.002 


0.056 


0.895 


1.114 


5.119 


0.726 


4.323 


6.401 


3 



Note. — The Fequal model is used for codon frequencies. Results for ML have been calculated after removing the infinite estimates. For t = 0.1, 0.5, and 1, there were no data sets 
with 0 or infinite estimate. is the number of replicates with cb-oo. 



those 1,121 with cb > 1 has cb- 1.22 (95% confidence inter- 
val— CI 0.37-4.01) and posterior mean £ = 0.93 (95% credi- 
bility interval — CI 0.36-2.43). This gene has a length of 262 
codons and has a small evolutionary distance with t = 0.043 
(95% CI 0.024-0.077) and t = 0.047 (95% CI 0.027-0.082), so 
that the prior has an impact. Another gene has cb = 1.27 (95% 
CI 0.75-2.1 6) and cb = 1 .1 3 (95% CI 0.60-2.1 3). This gene is 257 
codons in length and the ML and Bayesian distance estimates 
are 0.17 (95% CI 0.13-0.24) and 0.18 (95% CI 0.13-0.24), re- 
spectively. The second gene has a similar length to the first 
but because the sequence distance is greater, the prior is 
much less important. In a third gene, of length 1,019 
codons, the MLEs are t = 0.041 (95% CI 0.030-0.056) and 
cb= 1.27 (95% CI 0.77-2.07), compared with the Bayesian es- 
timates t = 0.042 (95% CI 0.031-0.057) and cb = 1.13 (95% CI 
0.59-2.14). In this case, the effect of the prior is unimportant, 
because the gene is long. 

Among the 1,121 genes with cb > 1 only 78 have sta- 
tistically significantly evidence of positive selection, based 
on the LRT (a - 5%) (table 4). All the 78 genes have the 
posterior mean cb>1. Moreover, out of them, three 
showed strong evidence of positive selection in the 
Bayesian analysis, with P(co > 1 | x) > 0.95 (table 4). 
The difference (78 vs. 3 genes) in the number of genes 
with co > 1 between the ML and the Bayesian method is 
consistent with the general expectation that the LRT 
tends to reject the null more readily than the Bayesian 
analysis. It is also consistent with the results observed in 
the computer simulations for t = 0.1 and co = 2. We note 
that the three genes significant in the Bayesian analysis 
have fairly large sequence divergences, with t « 0.1, 
whereas the other 75 genes (for which the LRT is signif- 
icant but the Bayesian evidence is not strong) have highly 
similar sequences, with t < 0.07 (with median 0.021). 

In the human-mouse comparison, the ML and Bayesian 
estimates are very similar. The sequence divergence is inter- 
mediate, the data are informative, and the prior does not have 
a noticeable impact. There are very few cases where the MLEs 
are extreme (0 or oo). Also, the number of genes showing co 
estimates >1 are nearly the same between the two methods 
(7 vs. 6) and the same two genes show significant evidence for 
positive selection by both methods. The mouse-rat 



comparison shows similar patterns to the human-mouse 
comparison: in both cases, the sequences are moderately di- 
vergent and the data are informative. 

To examine the sensitivity of posterior estimates of t and co 
to the prior, we reanalyzed the human-chimpanzee and 
human-mouse alignments using two alternative priors: AP1 
and AP2. The first alternative prior (AP1) is t~G(2, 2) and 
co ~ C(2, 4). This has the same means as the default prior of 
equation (2) but the prior is more informative because of the 
larger shape parameter (2 vs. 1.1). In the second alternative 
prior (AP2), we used 2 for the shape parameter, but chose the 
rate parameter such that the prior mean roughly matches the 
median of the MLEs for all genes (table 3). Thus, for the 
human-chimpanzee comparison, AP2 is t~G(2, 100), with 
the prior mean 0.02 (while the median of MLEs oft is 0.016), 
and co ~ C(2, 10), with the prior mean 0.2 (while the median 
of MLEs of co is 0.193). For the human-mouse comparison, 
AP2 specifies t ~ C(2, 3), with the prior mean 0.67 (while 
the median of the MLEs is 0.686) and co ~ C(2, 20), with the 
prior mean 0.1 (the median of the MLEs is 0.089). While it is 
in general not advisable to use the data to specify the prior, 
we note that in specific comparisons, some prior information 
may be available. For example, between the human and 
the chimpanzee, the distance t is very likely to be smaller 
than 0.1. 

Posterior estimates of co and t from the analysis using the 
default and alternative priors are illustrated in figures 5 and 6. 
In the human-chimpanzee comparison, the impact of the 
prior is apparent. The Bayesian co estimates using the AP1 are 
higher than those using the default prior for low co values 
(co < 0.5) and lower for high co values (co > 0.5) (fig. 5A). With 
a more informative prior (shape parameter 2), the posterior 
means are closer to the prior mean 0.5. For the human- 
mouse comparison estimates under AP1 are close to those 
under the default prior (fig. 5B). The Bayesian estimates of t 
are less affected by the change in the prior in both 
comparisons and the estimates are approximately the same 
for the majority of the genes (fig. 6A and B). Prior AP2 has a 
more significant effect. In both comparisons, the Bayesian 
estimates of co are smaller than those obtained using the 
default prior for almost all genes (fig. 5C and D). The priors 
are more informative (with shape parameter a = 2) and have 
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Fig. 4. Distributions (smoothed histograms) of Bayesian and ML estimates of t and co from mammalian and bacterial pairwise gene comparisons. 
Numbers of genes analyzed in each comparison are shown in the right part of the figure. 



lower means (0.2 and 0.1 for the human-chimpanzee and 
human-mouse comparisons, respectively, instead of 0.5) and 
thus affect posterior estimates more than the default prior. 
The effect is more apparent in the human-chimpanzee 
comparison because of the smaller sequence distances. 



Posterior estimates of t are less affected by the change 
in the prior (fig. 6C and D). In summary, the prior affects 
posterior estimates of co when the genes are not infor- 
mative about co and does not affect significantly the pos- 
terior estimates of t. 
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Table 3. Descriptive Statistics of Bayesian (top, underlined) and ML (bottom) Estimates of t and co from Pairwise Comparisons of Protein-Coding 
Genes from Mammalian Species and Bacterial Strains. 



co t 





No. of Genes 


Mean 


SD 




Quartiles 




N 0 




Mean 


SD 




Quartiles 




N 0 




25% 


50% 


75% 


25% 


50% 


75% 


H uman -ch i m panzee 


14,215 


0.369 


0.246 


0.180 


0.320 


0.500 


0 


0 


0.025 


0.072 


0.013 


0.019 


0.028 


0 


0 




0.307 


0.418 


0.062 


0.193 


0.411 


2507 


423 


0.022 


0.042 


0.010 


0.016 


0.025 


377 


2 


Human-mouse 


14,624 


0.130 


0.125 


0.044 


0.093 


0.176 


0 


0 


0.812 


0.574 


0.503 


0.691 


0.958 


0 


0 




0.126 


0.157 


0.040 


0.089 


0.170 


221 


0 


0.849 


1.252 


0.499 


0.686 


0.952 


0 


30 


Mouse-rat 


13359 


0.168 


0.168 


0.055 


0.118 


0.228 


0 


0 


0.242 


0.179 


0.163 


0.215 


0.281 


0 


0 




0.159 


0.180 


0.046 


0.108 


0.215 


509 


0 


0.238 


0.232 


0.161 


0.212 


0.278 


0 


3 


Escherichia coli 




0.179 


0.170 


0.055 


0.116 


0.252 


0 


0 


0.080 


0.354 


0.026 


0.043 


0.068 


0 


0 


K-1 2-E.coli 01 57 


2,619 


0.099 


0.174 


0.001 


0.034 


0.110 


912 


31 


0.073 


0.527 


0.020 


0.038 


0.064 


121 


6 


E. coli KAl-Salmonella 


2,619 


0.037 


0.042 


0.016 


0.025 


0.042 


0 


0 


2.261 


1.546 


1.153 


1.836 


3.129 


0 


0 


typhimurium LT2 


0.025 


0.042 


0.006 


0.018 


0.032 


164 


0 


5.052 


8.481 


1.087 


1.748 


4.066 


0 


217 



Note. — The F61 model is used for codon frequencies. Results for ML have been calculated after removing the infinite estimates. N 0 is the number of genes with the MLE co or 
i-0, whereas is the number of genes with the MLE co or t = co. 



Table 4. The Numbers of Genes with co Estimate Greater or Less 
than 1 Using the Bayesian and ML Methods. 



Data 






Bayesian 

O) < 1 O) > 1 


N L 


H uman-ch i m panzee 




cb< 1 


13,094 


0 


78 






6) > 1 


822 


299 








N B 




3 




Human-mouse 




O) < 1 


14,617 


0 


2 






d) > 1 


1 


6 








N B 




2 




Mouse-rat 


ML 


6j < 1 


13,313 


0 


5 






O) > 1 


10 


36 








N B 




2 




Escherichia coli K-12-E. 




d) < 1 


2,574 


0 


0 


Coli 0157 




O) > 1 


43 


2 








N B 




0 




£. coli KAl-Salmonella 




6) < 1 


2,617 


0 


0 


typhimurium LT2 




O) > 1 


2 


0 








N B 




0 





Note. — N L is the number of genes with statistically significant co > 1 based on the 
LRT at the 5% level (one-sided with critical value 2.71) in the likelihood method 
whereas N B is the number of genes with P(oo > 1 | x) > 0.95 in the Bayesian 
analysis. 



Analysis of the Bacterial Data Set 

We conduct two pairwise comparisons: E. coli K-1 2 versus 
E. coli 0157:H7 and E. coli K-1 2 versus S. typhimurium 
LT2. Note that the two strains of E. coli have the same 
evolutionary distance from the Salmonella. 

The sequences from the two E. coli strains are very 
similar, and the prior has an impact on Bayesian esti- 
mates, similar to the comparison of the human and chim- 
panzee genes. The mean, median, and 25% and 75% 
percentiles of the Bayesian co estimates are 0.179, 0.116, 
and (0.055, 0.252) while the corresponding results for the 
MLEs are 0.099, 0.034, and (0.001, 0.110). The two meth- 
ods are thus very different in analysis of those genes. Also, 
the MLE co = 0 in 912 genes and co = 00 in 31 genes. 



None of the genes with co > 1 is statistically significant 
at the a - 5% significance level according to the LRT and 
none has P(co > 1 | x) > 0.95 (table 4). The gene se- 
quences from the E. coli K-1 2 and Salmonella are quite 
divergent. In most genes, the two methods produced 
similar estimates (fig. 4). However, some genes are very 
divergent with the MLE t = oo in 217 genes. 

Discussion 

We suggest that if possible one should conduct joint 
comparative analysis of multiple protein-coding gene se- 
quences on a phylogeny, instead of pairwise comparisons. 
In particular, a number of LRTs have been developed to 
detect positive selection that affects particular evolution- 
ary lineages on the phylogeny or individual sites in the 
protein (see, e.g., Yang [2006a] and Cannarozzi and 
Schneider [2012], for reviews). To apply such tests of 
positive selection, it is essential to use multiple sequences, 
as a pair of sequences hardly contains enough information 
for the tests to have any power (e.g., Yang 2006b). Some 
proteins may evolve in an episodic manner and thus 
adaptive episodes may not be detected in pairwise com- 
parisons, especially when the sequences are distantly re- 
lated (Messier and Stewart 1997). In a pairwise 
comparison, positive selection is detected only if the co 
averaged over all sites in the protein and over the whole 
evolutionary history connecting the two sequences is >1. 
This seems to be an extremely stringent criterion. Analysis 
of multiple sequences on a phylogeny allows one to 
detect episodic positive selection that affects a particular 
branch (Yang 1998). 

Nevertheless, we note that pairwise sequence comparisons 
are widely used, especially in comparative genomics, 
sometimes to provide summary statistics of the data and 
sometimes because of lack of a third genome. The ML 
method has been used to estimate co and t in pairwise 
comparisons of genes (e.g., Nielsen et al. 2005; Ge et al. 
2008; Walters and Harrison 2010; Buschiazzo et al. 2012; 
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Posterior mean of co 
Default prior = 
co- G(1. 1,2.2), f~ G(1.1, 1.1) 

Fig. 5. Bayesian estimates of co for the human-chimpanzee (A and C) and human-mouse (B and D) comparisons using two alternative priors plotted 
against estimates using the default prior (eq. 2). The alternative priors are: (A and B) co~ G(2, 4), t~ G(2, 2); (C) co~ G(2, 10), t~ G(2, 100); and (D) 
co ~G(2, 20), t~G(2, 3). 



Gladieux et al. 2013; Wang and Chen 2013). Counting 
methods are also used due to their simplicity (Garcia- 
Gil et al. 2003; Schenekar et al. 2011; Graves et al. 
2013), even though they were found not to perform as 
well as ML in computer simulations (Yang and Nielsen 
2000). Both counting and ML methods sometimes return 
0 or oo as estimates, so that neither the expectation nor 
the variance of the estimates is finite. The infinity esti- 
mates of co appear to be particularly confusing to many 
users of the methods. To avoid such extreme estimates, 
some authors (e.g., Novaes et al. 2008; Bajgain et al. 2011; 
Pellino et al. 2013) added a small arbitrary number (pseu- 
docounts) to the numbers of synonymous and nonsynon- 
ymous substitutions before calculating co. Other authors 
excluded genes with d s = 0 from their analysis (e.g., Wang 
and Chen 2013). The Bayesian method implemented here 
may provide a better procedure than such ad hoc treat- 
ments. It always returns finite estimates of co and t as the 
prior penalizes extreme values. Our computer simulation 
suggests that the Bayesian estimates of co have nice sta- 
tistical properties, with similar or smaller MSEs compared 
with the MLEs. The posterior means are close to the 



MLEs when the data are informative, that is, when the 
sequences are long and the sequence divergence is inter- 
mediate, but the differences can be large when the se- 
quences are short and are either too similar or too 
divergent. Nearly identical sequences contain little infor- 
mation while extremely divergent sequences contain too 
much noise concerning co. In both cases, the data are not 
informative and the prior has an impact on posterior 
estimates of co. However, as sequence length increases 
the effect of the prior decreases irrespective of the true 
values of co and t. Our Bayesian method is used for the 
analysis of only two sequences. A Bayesian method for 
the analysis of multiple sequences in a phylogeny requires 
calculation of high-dimensional integrals and is not pur- 
sued here. 

We emphasize that MLEs co = oo should not be taken 
as evidence for positive selection (co> 1) because the ex- 
treme estimate may well be due to chance effects when 
the numbers of changes are small. Instead, positive selec- 
tion can be claimed only if the LRT is significant in the 
ML framework or when P(co > 1 | x) > 0.95 in the 
Bayesian analysis. 
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Fig. 6. Bayesian estimates of t for the human-chimpanzee (A and C) and human-mouse (B and D) comparisons using different gamma priors. The 
alternative priors are as in figure 5. 



Program Availability 

The Bayesian method of this article is implemented in the 
CODEAAL program in the PAML package. The program allows 
the user to specify gamma priors for t and co. Although the 
Bayesian method is computationally more intensive than ML, 
it remains fast enough for large-scale screening. It takes 1-2 s 
to analyze one pair of sequences on a modern PC. 

Methods and Materials 

Theory 

We use a simplified version of the model of Goldman and 
Yang (1994) to model the evolution of codon sequences 
(Yang and Nielsen 1998). The model accounts for the genetic 
code structure, the transition/transversion rate ratio, the 
codon frequencies as well as the d N /d s rate ratio co. The in- 
stantaneous substitution rate from codon / to codon j (/ ^ j) 
is given by 



<?f = 



0, if / and j differ at two or three codon positions, 
77), if / and j differ by a synonymous transversion, 
K7tj, if / and j differ by a synonymous transition, 
coTtj, if / and j differ by a nonsynonymous transversion, 
coKTtj, if / and j differ by a nonsynonymous transition, 

(5) 



where Tr, is the equilibrium frequency of codon j. Stop co- 
dons are not considered (they are assumed not to occur 
within protein-coding genes). Therefore, the substitution 
rate matrix Q = {q$ is of size 61 x 61 for the standard ge- 
netic code. The rate matrix is scaled so that the average 
rate of codon substitution equals — ^Zfl-, it -flu = 1, and 
thus time is measured by the expected number of nucleotide 
substitutions per codon site. We use standard theory to cal- 
culate the transition probability matrix over time t as 
P(t) = exp(Qt). The likelihood function on a pairwise sequence 
alignment x is 

ic 

f(x\t,K,CO) = y[7tiPij(t), (6) 
h=1 

where / and j are the observed codons in the two sequences at 
site h and L c is the number of codons. 

The joint posterior distribution of co and t is given by 
equation (1). If k is a parameter in the model we replace it 
with its AALE (/?). If the two sequences are identical so that k is 
not unique, we fix it at 2. Besides the posterior means of co 
and t given in equations (3) and (4), we also calculate the 
posterior variances and covariance 

Va r (a>\x) = E(co 2 \x)-[E(co\x)f, (7) 
Var(t|x) = £(t 2 |x)-[£(t|x)] 2 , (8) 
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Cov(tt), t\x) = E(a)t\x) - E(a> \ x)E(t | x). 



(9) 



Thus, six double integrals need to be computed, one for the 
normalizing constant C, and five for the different expectations 
in equations (3), (4), and (7)-(9). 

Consider the calculation of the normalizing constant C All 
other integrals are calculated in the same way. We write 
g(t,co) =f(x | t,co)f(t,co). To avoid overflows and underflows, 
we set h(t,ft>) = exp{log[g(fc co)]-l m J, where / max is the 
maximum of g(t, co), a constant chosen for scaling. The nor- 
malizing constant can then be written as 



C = exp(/ max ) 



oo oo 



co)dtdco. 



(10) 



We use the Gaussian quadrature method to calcu- 
late all integrals numerically, which uses Legendre polyno- 
mials to approximate any continuous integrand function 

fix, y): 



I I 

// 



/(x,y)dxdy^ 



01) 



The weights w, and Wj and the points x, and at which the 
integrand is evaluated are predetermined given the total 
number of points n. In our case, the limits of the integrals 
are 0 and oo and we have to use a transformation to map the 
(0, oo) limits to (—1, 1). A much more serious problem is that 
the integrand g may be spiky (i.e., it is highly concentrated in a 
very small interval) and the approximation will be very poor if 
the sampled points miss the spike in the integrand. The ra- 
tionale behind our transformation is to find a probability 
density function (PDF) that has a similar shape to the inte- 
grand g(t,co) and then we use its cumulative distribution 
function (CDF) to transform the integrand. Note that if the 
chosen PDF matches the posterior exactly, the new integrand 
will become perfectly flat after the transformation. The logis- 
tic distribution is used for that purpose. 

Let x^logt ~ Logistic^, a q ) and x 2 = logo; ~ Logistic 
(/z 2 , o 2 ). For any random variable x ~ Logistic(/x, a) the 
CDF is F L (x) = 1+e _ 1 (x _ M)/CT . Thus, for equation (10), we use 
the following transformation (change of variables): 

z, = 2F L (x 1 ) - 1 =>► t = expj/XT + oi log^^J, (12) 

z 2 = 2F L (x 2 )-1 co = exp|/x 2 + a 2 log ^ j. (13) 
Thus, equation (10) becomes 



C = exp(/ max ) j j r(z 1 ,z 2 )dz 1 dz 2 

(14) 



exp(/ max ) ^ w i w j r(zi i ,z 2j ), 

'V=1 



where r(z 1? z 2 ) = h(t, co) f°\ ^ and t and co are given by 

i z 1 I z 2 

equations (12) and (13), respectively. We transform all other 
integrals in equations (3), (4), and (7)-(9) in the same way. 
Thus, we have 



E(co |x) « \ W}WjCO}r(z Vi ,z 2j ), 

n 

E(t | x) «± WjWjtjr(z Vi ,z 2j ), 

n 

E(co 2 \x)^±J2 WiWjCo}r(z Vi , z 2 .), / q 5 ) 

1 n 

E(tco | x) ^ - ^2 WiWjCo } tjr(z Vi ,z 2j ), 
A 'V=i 

where A = Cexp(— / max ). Notice that the exponential term 
exp(/ max ) cancels out during calculations. 

Our Bayesian calculation is performed after the MLEs are 
obtained. Thus if both t and co are finite, away from 0 and the 
observed p s and p N (proportion of synonymous differences 
per synonymous site and proportion of nonsynonymous 
differences per nonsynonymous site, respectively) are 

<0.74, we set = logt, /i 2 - logci, = (j)^Jv(t)> an ^ 



o 2 = (VjJ V(co). The variances V(t) and V(<i>) are estimated 

using the Nei and Gojobori (1986) method. Because the Nei 
and Gojobori method uses the Jukes and Cantor (1969) 
nucleotide substitution model (JC69) to correct for multiple 
hits, the use of 0.74 as an upper limit for the p s and p N 
guarantees an adequate estimation of V(t) and V(co). 

In all other cases, we find numerically the point (t, co) that 
maximizes log{g(t, co)}. We calculate the Hessian matrix at this 
point using the second-order difference method and use the 
inverse of the Hessian to estimate the variances V(t) and 

\f(co). Then, we set /i^ = logt, /i 2 - logcZ), = (|)^ 



and o 2 = (£j^jv{co). Notice that because of our choice of 

the prior, log(g) always has a mode and thus the optimization 
algorithm returns a point away from (0, 0). 

We use the same number of points n for both parameters 
co and t in the Gaussian quadrature. With n = 32, each sum 
in equation (15) requires 32 x 32= 1,024 evaluations of the 
r(z v z 2 ) function. Tests suggest that using 32 points achieves 
high accuracy. The use of more points increases the 
computational time radically since evaluation of r(z v z 2 ) re- 
quires evaluation of the likelihood which is computationally 
expensive. Moreover, we use the same techniques 
described above to calculate the posterior probability 

oo oo 

P(co > 1 | x) = £ f f f(x 1 1, co, k) f(t, co)dcodt, as a Bayesian 
o 1 

equivalent of the LRT for positive selection indicated by co > 1. 
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Real Data Analysis 

Both the new Bayesian method of this article and the ML 
method of Goldman and Yang (1994) were applied to com- 
pare protein-coding genes from mammalian species and bac- 
terial strains. The mammalian data set is a subset of the data 
analyzed by dos Reis et al. (2012). There are 14,218 genes from 
the human and chimpanzee, with the sequence length rang- 
ing from 39 to 8,797 codons; 14,631 genes from the human 
and mouse with the sequence length from 13 to 8,787 
codons; and 13,371 genes from the mouse and rat with the 
sequence length from 14 to 7,798 codons. The protein-coding 
sequences from the genomes of £ coli 0157:H7, £ coli 
K-12, and S. typhimurium LT2 were downloaded from 
GenBank (accession numbers: U_00096, NC_002655, and 
NC_003197). Orthologous genes among the three genomes 
were identified by using the program BLAT (Kent 2002) to 
extract the best reciprocal hits. Only orthologs present in all 
three genomes are used. This bacterial data set consists of 
2,631 genes from each strain, with the sequence length rang- 
ing from 20 to 1,485 codons. Codons involving alignment 
gaps and ambiguity nucleotides were removed prior to ana- 
lyses. Moreover, genes with sequence length of 50 codons 
or less were excluded from the analysis. The number of 
genes analyzed in each comparison is reported in table 3 
and figure 4. 
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